Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/1692 api v3 runs #2073

Merged
merged 191 commits into from
Jun 23, 2022
Merged

Feature/1692 api v3 runs #2073

merged 191 commits into from
Jun 23, 2022

Conversation

dk1844
Copy link
Contributor

@dk1844 dk1844 commented May 27, 2022

This PR implements Runs v3 API #1692.

  • Pagination functionality is not included.
  • Endpoints description below
  • Endpoints are covered by an integTest, the RunRepositoryV3 is covered by a separate integTest, too.
  • Special attention was given to efficiently aggregate the queries of the run collection

Endpoints details

GET /runs[?startDate=yyyy-MM-dd|sparkAppId=...|uniqueId=...]

  • returns (wrapped) list of RunSummaries (Run without splineRef and controlMeasure)
  • form of "last of each" (having runs grouped by datasetName and versionName - the Run with the latest runId is only returned

GET /runs/{datasetName}[?startDate=yyyy-MM-dd]

  • same as above

GET /runs/{datasetName}/{datasetVersion}[?startDate=yyyy-MM-dd]

  • returns (wrapped) list of RunSummaries (Run without splineRef and controlMeasure)
  • no filtering, objects with all runIds should be returned

PUT /runs/{datasetName}/{datasetVersion}

  • payload = Run
  • response contains the Location header with url to the newly created entity

GET /runs/{datasetName}/{datasetVersion}/{runId}

  • returns the Run object (contains splineRef and controlMeasure)
  • allows latest for a runId instead of #.

PUT /runs/{datasetName}/{datasetVersion}/{runId}

  • payload = RunStatus (logically updated state, optional error)
  • there is currently no restrictions on the state transitions (details here)

GET /runs/{datasetName}/{datasetVersion}/{runId}/checkpoints

  • allows latest for a runId instead of #.

POST /runs/{datasetName}/{datasetVersion}/{runId}/checkpoints

  • payload = Checkpoint
  • the new checkpoint is check for name duplicate and refused with a failed validation in such a case
  • response contains the Location header with url to the newly created entity

GET /runs/{datasetName}/{datasetVersion}/{runId}/checkpoints/{checkpointName}

  • allows latest for a runId instead of #

GET /runs/{datasetName}/{datasetVersion}/{runId}/metadata

  • allows latest for a runId instead of #.
  • there is no endpoint for GET .../metadata/{metadataName}, because it does not seem to make a lot of sence (but I am open to suggstions in this part)

Consideration for latest-of-each RunSummaries in MongoDB

The latest-of-each retrieval poses an approach conundrum. I have basically thought about these options to implement it:

  1. grouping with max(runId) and joining on the original documents data - using lookup just as the V2 repository does it
  2. using Mongo's aggregate, while storing the documents in the with the group and then filtering by the max(runId)
  3. leverage a sort-of-all before grouping and then use grouping with first aggregate accumulator (guaranteed to be the max runId because of the pre-sorting)

The first option seems like overkill, but I admit I have not tried it. With options 2 and 3, I have written basic implementations of such aggregate queries and rudimentarily measured it on our UAT environment holding ~450k run records.

Option 3 appeared to take about two times (~30s) more time than option 2 (~15s), provided that there is an appropriate index present in the collection for the pre-sorting. Surprisingly, this index { $sort: { "runId": -1 } } was present on our UAT, but must have been added extra, because our default setup does not introduce this index to the runs_v1 collection. Thus, in the code, I have added it to be present for future MongoDB provisionings.

Option 2 mongo script used:

db.getCollection('run_v1').aggregate([
{ $sort: { "runId": -1 } }, // because of this sorting, the $first below yields max runId in the groups
{
    $group: {
      _id: {
           "dataset": "$dataset",
           "datasetVersion": "$datasetVersion"
      },
      datasetName: { $first : "$dataset"},
      datasetVersion: { $first : "$datasetVersion"},
      runId: { $first: "$runId" },
      status: { $first: "$runStatus.status" },
      startDateTime: { $first : "$startDateTime"},
      uniqueId: { $first: "$uniqueId" }

    },
},
{
    $project: {
        "_id" : 0 
    }
}
])

Option 3 mongo script used:

db.getCollection('run_v1').aggregate([
{
    $group: {
      _id: {
           "dataset": "$dataset",
           "datasetVersion": "$datasetVersion"
      },
      maxRunId: { $max: "$runId"},
      docs: {
          $push: '$$ROOT' // saving the documents
      }
    }
},
{
  $project: {
    _id: 0,
    doc: {
      "$setDifference": [{
          "$map": {
            "input": "$docs",
            "as": "mapped",
            "in": {
              "$cond": [{
                  "$eq": ["$maxRunId", "$$mapped.runId"] // filtering those with maxRunId here
                },
                "$$mapped",
                false
              ]
            }
          }
        },
        [false]
      ]
    }

  }
}, {
  $unwind: '$doc'
}, {
    $project: {
      datasetName: "$doc.dataset",
      datasetVersion: "$doc.datasetVersion",
      runId: "$doc.runId" ,
      status: "$doc.runStatus.status" ,
      startDateTime: "$doc.startDateTime",
      uniqueId:  "$doc.uniqueId" 
    }
}
  
], { allowDiskUse: true })

benedeki and others added 30 commits March 13, 2021 23:54
* 1422 and 1423 Remove HDFS and Oozie from Menas

* #1422 Fix HDFS location validation

* #1424 Add Menas Dockerfile

* #1416 hadoop-aws 2.8.5 + s3 aws sdk 2.13.65 compiles.

* #1416 - enceladus on S3:

 - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running standardization works via:
spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/stderr.txt

* #1416 - enceladus on S3 - (crude) conformance works on s3 (s3 std input, s3 conf output)

 0- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running conformance works via:
spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/conf-log.txt

* ref issue = 1416

* related test cases ignored (issue reference added)

* PR updates

* Merge spline 0.5.3 into aws-poc

* Update spline to 0.5.4 for AWS PoC

* #1503 Remove HDFS url Validation

This is a temporary solution. We currently experiment with
many forms of URLs, and having a regex there now slows us down.

* New dockerfile - smaller image

* s3 persistence (atum, sdk fs usage, ...) (#1526)

#1526 
* FsUtils divided into LocalFsUtils & HdfsUtils
* PathConfigSuite update
* S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut
TestRunnerJob updated to manually cover the cases - should serve as a basis for tests
* HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting)
* using final version of s3-powered Atum (3.0.0)
* mockito-update version update, scalatest version update
* S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive)
* explicit stubbing fix for hyperdrive

* Feature/1556 file access PoC using Hadoop FS API (#1586)

* s3 using hadoop fs api
* s3 sdk usage removed (pom, classes, tests)
* atum final version 3.1.0 used
* readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs)

* 1554 Tomcat with TLS in Docker container (#1585)

* #1554 Tomcat with TLS container

* #1554 Added envoy config + enabling running unencrypted container

* #1499 Add authentication to /lineage + update spline to 0.5.5

* #1618 - fixes failing spline 0.5.5 integration by providing compatible commons library version. Test-ran on EMR. (#1619)

* #1612 Separation start

* #1612 Updated DAO for spark-jobs

* #1612 Fixed spline integration and schema, removed redundant code

* #1612 Fixed tests, removed unused dependency

* #1612 Added back dependency

* WIP fixing merge issues

* * Merge compiles
* Tests pass
* Depends on ATUM 3.1.1-SNAPSHOT (the bugfix for AbsaOSS/atum#48)

* #1612 Removed Spring from menas-web, enabled building war and static resources. Removed version subpath in menas-web + added theme dependencies in repo

* #1612 Cookies + updated lineage

* * put back HDFS browser
* put back Oozie
* downgraded Spline

* * AWS SDK Exclusion

* #1612 Included HDFSFolder + missing Oozie parts

* * New ATUM version

* * Adding missing files

* #1612 menas-web on nginx container and passing API_URL

* #1612 Working TLS on nginx, resources not included in code

* 1622: Merge of aws-poc to develop brach
* Addressed issues identified by reviewers

* * comments improvement

* 1434 Add new way of serving properties to Docker

* #1612 Building using ui5 + reused /api route

* #1612 Project version

* #713 Add favicon

* #1612 Merges

* #1612 pom parent version

* #1648 Fix war deployment + adding back spline to menas

* #1612 other fixes

* #1612 added pom package.json version sync

* #1612 newline

* #1612 fix version sync + cleaning dist

* 1648 merge to develop

* 1648 merge fix

* 1648 Fixes schema upload

* 1648 Fixes schema registry request

* 1648 pom version

* 1612 add docker build

* #601 Swagger 2 PoC

* #601 Swagger 2 PoC

* #601 Swagger 2 PoC

* #1648 Updating menas-web to 3.0

* #1612 Updated npm project versions + mvn plugin

* #1612 license_check.yml

* #1612 licence check fix

Co-authored-by: Saša Zejnilović <[email protected]>
Co-authored-by: Daniel Kavan <[email protected]>
Co-authored-by: Jan Scherbaum <[email protected]>
Co-authored-by: David Benedeki <[email protected]>
% Conflicts:
%	data-model/src/main/scala/META-INF/MANIFEST.MF
%	menas-web/ui/components/dataset/conformanceRule/MappingConformanceRule/targetSchemaFieldSelector.fragment.xml
%	menas/ui/index.html
…velop-ver.30

# Conflicts:
#	menas/src/main/scala/za/co/absa/enceladus/menas/MvcConfig.scala
merging develop into develop-ver3.0 (via  feature/merging-develop-ver.30 branch)
KafkaErrorSenderPluginSuite test fix for SourcePhase.{Standardization, Conformance} capitalization.
* #1769 rename menas to rest-api

* #1769 README update and other feedback
* #1733 final version 0.6.0 of spline spark agent used
* #1733 cherrypicked version: commons 0.0.27, atum 3.5.0, spline agent 0.6.0-snapshot
* #1733 PR touchup: maven.shade.version version revert to 3.2.1
* #1770 Rename menas-web to menas
% Conflicts:
%	dao/pom.xml
%	data-model/pom.xml
%	examples/pom.xml
%	menas/pom.xml
%	migrations-cli/pom.xml
%	migrations/pom.xml
%	plugins-api/pom.xml
%	plugins-builtin/pom.xml
%	pom.xml
%	spark-jobs/pom.xml
%	utils/pom.xml
Spline 0.6 integration
% Conflicts:
%	menas/pom.xml
%	menas/ui/index.html
%	pom.xml
%	rest-api/src/main/scala/za/co/absa/enceladus/rest_api/WebSecurityConfig.scala
%	rest-api/src/main/scala/za/co/absa/enceladus/rest_api/controllers/DatasetController.scala
% Conflicts:
%	menas/pom.xml
…elate to rest_api, not menas: `/3rdParty/**` should not be added anymore as it is menas-related)
* Adding back Menas module, that somehow got omitted.
…lop-ver-3

Merging Release 2.23. 0 into develop ver 3
Base automatically changed from bugfix/2075-validation-order to develop June 7, 2022 10:53
% Conflicts:
%	README.md
%	spark-jobs/src/main/scala/za/co/absa/enceladus/common/CommonJobExecution.scala
.find[BsonDocument](filter)
.headOption()
.map(_.map(bson => SerializationUtils.fromJson[Run](bson.toJson)))
// why not just .find[Run]? Because Run.RunStatus.RunState is a Scala enum that does not play nice with bson-serde
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the comment 👍

import za.co.absa.enceladus.model.{Run, SplineReference, Validation}

import scala.concurrent.Future
import scala.util.{Failure, Success, Try}

@Service
class RunService @Autowired()(val mongoRepository: RunMongoRepository)
@Service("runService") // by-name qualifier: making V2 autowiring un-ambiguous
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would have time to explain this to me (educating 😉 )

Copy link
Contributor Author

@dk1844 dk1844 Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunServiceV3 extends RunService. This has a few (expected) effects:

  • Wherever a field of type RunServiceV3 is needed, @Autowired only has 1 option to use - RunServiceV3.
  • Wherever a field of type RunService is needed, @Autowired has 2 options to use - RunService or RunServiceV3 and thus it fails with ambiguity exception (No qualifying bean of type 'za.co.absa.enceladus.rest_api.services.RunService' available: expected single matching bean but found 2: runService,runServiceV3). Therefore it needs to be disambiguated somehow (some details here, but basically this is summed up by By default, Spring resolves @Autowired entries by type.)

There are multiple ways how to hint the desired wiring to be used, e.g. mark one of the implementations by @Primary etc. The option I have used here is that I mark the bean with a qualifier -- in my case one that matches the field name where requested. So, I have:

@Service("runService") // by-name qualifier: making V2 autowiring un-ambiguous
class RunService()

@Service
class RunServiceV3() extends RunService

class LandingPageController @Autowired() (
  ...,
  runService: RunService, // matches the qualifier of RunService by field name => no ambiguity, RunService is wired.
)

In case that you could not match the name of the qualifier with the field name (unable to change either for some reason), this can be solved by adding the @Qualifier(qualifierName) to the property as shown with @Qualifier("fooFormatter") at https://www.baeldung.com/spring-autowire#1-autowiring-by-qualifier. So it would look like this:

@Service("runService")
class RunService()

@Service
class RunServiceV3() extends RunService

class LandingPageController @Autowired() (
  ...,
  @Qualifier("runService") // matches the qualifier of RunService => no ambiguity, RunService is wired.
  runServiceWithADifferentName: RunService, 
)

@@ -42,6 +44,7 @@ class RunRepositoryIntegrationSuite extends BaseRepositoryTest {
private val runFixture: RunFixtureService = null

@Autowired
@Qualifier("runMongoRepository") // to correctly wire V2 runMongoRepository
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another education request, please.

Copy link
Contributor Author

@dk1844 dk1844 Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same case as #2073 (comment)
A field of type RunMongoRepository can be wired with more than one class, so the qualifier here helps to resolve that -- when the field of this type is named runMongoRepository or accompanied by @Qualifier("runMongoRepository").

@Zejnilovic
Copy link
Contributor

Question 1

On non-existing URLs I do not get a 404 but rather an empty array.
/api-v3/runs/DeleteMeXYZ -> the dataset does not exist
/api-v3/runs/DeleteMe/1223 -> the version doesn't exist
These return 200 and an empty array.

I can see why return an empty array if the dataset exists but has no runs for the Name or the Version, but if the dataset name or combination of name and version don't even exist in Dataset collection, then it is misleading.

Thoughts?

@benedeki
Copy link
Collaborator

Question 1

On non-existing URLs I do not get a 404 but rather an empty array. /api-v3/runs/DeleteMeXYZ -> the dataset does not exist /api-v3/runs/DeleteMe/1223 -> the version doesn't exist These return 200 and an empty array.

I can see why return an empty array if the dataset exists but has no runs for the Name or the Version, but if the dataset name or combination of name and version don't even exist in Dataset collection, then it is misleading.

Thoughts?

I agree. 404 in case Dataset & Version does not exists.
Empty array if no runs for existing Dataset & Version

@Zejnilovic
Copy link
Contributor

Suggestion 2

on Update status, that is PUT /runs/{datasetName}/{datasetVersion}/{runId}, we return plaintext message. I would rather go with

{
  "status": "updated",
  "message": "message"
}

But this already smells of the JSON envelope. So maybe we keep it for the next PR

… instead of lexicographical comparison as before (which does not work for DD-MM-YYYY). Input changed for controllers - now accepts YYYY-MM-DD instead of DD-MM-YYYY from the client.

Tests added + existing adjusted.
Zejnilovic
Zejnilovic previously approved these changes Jun 16, 2022
Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code reviewed
built
run

@Zejnilovic Zejnilovic added the PR:tested Only for PR - PR was tested by a tester (person) label Jun 16, 2022
Copy link
Contributor

@AdrianOlosutean AdrianOlosutean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review

…tasetVersion} are checked for ds existence => 404
@dk1844 dk1844 dismissed stale reviews from AdrianOlosutean and Zejnilovic via 3db75dd June 20, 2022 09:12
@dk1844
Copy link
Contributor Author

dk1844 commented Jun 20, 2022

Hi guys,

Question 1

On non-existing URLs I do not get a 404 but rather an empty array. /api-v3/runs/DeleteMeXYZ -> the dataset does not exist /api-v3/runs/DeleteMe/1223 -> the version doesn't exist These return 200 and an empty array.
I can see why return an empty array if the dataset exists but has no runs for the Name or the Version, but if the dataset name or combination of name and version don't even exist in Dataset collection, then it is misleading.
Thoughts?

I agree. 404 in case Dataset & Version does not exists. Empty array if no runs for existing Dataset & Version

This I have fixed as suggested. For /api-v3/runs/justDatasetNameWithoutVersion, I have added check for latest-version, so this option is covered as well to produce 404 if no version is found for this dataset.
Tests added.

Suggestion 2

on Update status, that is PUT /runs/{datasetName}/{datasetVersion}/{runId}, we return plaintext message. I would rather go with

{
  "status": "updated",
  "message": "message"
}

But this already smells of the JSON envelope. So maybe we keep it for the next PR

Yes, it does smell a bit as to be part of #2062, thus, so far, I have just wrapped the response in an object with a message field (so it is a valid JSON with the message, but no other fields like status), but for the rest of the effort, we should solve this in #2062

@sonarcloud
Copy link

sonarcloud bot commented Jun 21, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 19 Code Smells

No Coverage information No Coverage information
9.3% 9.3% Duplication

Copy link
Contributor

@AdrianOlosutean AdrianOlosutean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dk1844 dk1844 merged commit c398997 into develop Jun 23, 2022
@dk1844 dk1844 deleted the feature/1692-api-v3-runs branch June 23, 2022 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR:tested Only for PR - PR was tested by a tester (person)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experimental API 2 - Runs
6 participants